{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Customizing Buckaroo\n",
    "Buckaroo consists of\n",
    "* The BuckarooWidget which coordinates updates to the frontend, management of analysis, the lowcode UI, and auto_cleaning functionality.\n",
    "* The frontend JS Table - handles display of dataframes, configured through interfaces.\n",
    "* The pluggable analysis framework which orders execution of customized analysis objects, and handles catching errors along with error reporting.\n",
    "* User supplied analyis objects, these operate on the dataframes to build the summary stats table, and configure the frontend display\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import buckaroo\n",
    "from buckaroo.buckaroo_widget import BuckarooWidget\n",
    "\n",
    "#df = pd.read_csv(\"https://s3.amazonaws.com/tripdata/201401-citibike-tripdata.zip\")\n",
    "df = pd.read_parquet(\"./citibike-trips-2016-04.parq\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "**These docs need updating for 0.5** Take a look at the [customizations](https://github.com/paddymul/buckaroo/tree/main/buckaroo/customizations) directory in the codebase and file some bugs asking for your suggested improvement.  I expect to add a lot more xamples around the 0.6 series"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "# Adding a summary stat\n",
    "Buckaroo is completely customizeable.  In the next cells we will add `Variance` to an instance of the BuckarooWidget with the `Pluggable Analysis Framework`.\n",
    "\n",
    "## Why was the Pluggable Analysis Framework built?\n",
    "The `Pluggable Analysis Framework` is engineered to allow summary_stats to be built up piecemeal and incrementally.  Traditionally when writing bits of analysis code, the tendency is to have large brittle functions that do a lot at once.  Adding extra stats either requires copying and pasting the existing function with one small addition, writing each stat independently and possibly recomputing existing stats, having a strictly ordered set of analysis functions, or some complex adhoc argument passing scheme.  I have written adhoc versions in each of these patterns.  Problems are manifest and the aparatus rarely survives even copy-pasting to the next notebook.\n",
    "\n",
    "## How does the Pluggable Analysis Framework work?\n",
    "The `Pluggable Analysis Framework` is built around a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph) of `ColAnalysis` nodes that can depend (`requires_summary`) on other summary stats which `provides_summary` the right values, and provide one or more summary stats.  Nodes cand be added to the dag with `add_analysis`.  If a class with the same name is inserted into the DAG, the newly inserted node replaces the previous instantiation.  This all facilitates interactive development of analysis functions.  During execution errors are caught and execution proceeds.  This is important because breaking the default dataframe mechanism is a show stopping problem for users"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "w = BuckarooWidget(df[:500])\n",
    "w"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from buckaroo.pluggable_analysis_framework.col_analysis import (ColAnalysis)\n",
    "from buckaroo.dataflow.styling_core import StylingAnalysis\n",
    "\n",
    "class Variance(StylingAnalysis):\n",
    "    #note we also override pinned rows so you can see this new stat, this is mainly used for development. \n",
    "    # Normally you would want to extend \"ColAnalysis\" to just provide the stat\n",
    "    pinned_rows = [\n",
    "        {'primary_key_val': 'variance', 'displayer_args': {'displayer': 'obj' }}]\n",
    "\n",
    "    provides_summary = [\"variance\"]\n",
    "    #a bit hacky, the newly added analyis needs to be the last in the dependency chain\n",
    "    requires_summary = [\"histogram\"]\n",
    "\n",
    "    @staticmethod\n",
    "    def series_summary(sampled_ser, ser):\n",
    "        if pd.api.types.is_numeric_dtype(ser):\n",
    "            return dict(variance=ser.var())\n",
    "        return dict(variance=\"NA\")\n",
    "\n",
    "w.add_analysis(Variance)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "## Basic Unit testing is built in\n",
    "\n",
    "Because there are so many corner cases with numerical code, every time a new summary stat is added, a variety of simple tests are run against it.  This lets you discover bugs earlier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "#broken as 0f 0.6\n",
    "\n",
    "small_df = df[:500][df.columns[:4]]\n",
    "# we are going to create, but not display a BuckarooWidget here, we are looking at the error behavior\n",
    "w = BuckarooWidget(small_df, debug=True)\n",
    "\n",
    "class Variance(ColAnalysis):\n",
    "    provides_summary = [\"variance\"]\n",
    "    requires_summary = [\"mean\"]\n",
    "    \n",
    "    @staticmethod\n",
    "    def summary(sampled_ser, summary_ser, ser):\n",
    "        mean = summary_ser.get('mean', False)\n",
    "        arr = ser.to_numpy()\n",
    "        #toggle SIMULATED_BUG to easily see behavior with and without a bug\n",
    "        SIMULATED_BUG = False\n",
    "        if SIMULATED_BUG:\n",
    "            if mean in [pd.NA, np.nan, False]:\n",
    "                return dict(variance=\"NA\")\n",
    "        else:\n",
    "            if mean is pd.NA or mean is np.nan or mean is False:\n",
    "                return dict(variance=\"NA\")\n",
    "        if mean and pd.api.types.is_integer_dtype(ser):\n",
    "            return dict(variance=np.mean((arr - mean)**2))\n",
    "        elif mean and pd.api.types.is_float_dtype(ser):\n",
    "            return dict(variance=np.mean((arr - mean)**2))\n",
    "        return dict(variance=\"NA\")\n",
    "\n",
    "w.add_analysis(Variance)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from buckaroo.pluggable_analysis_framework.analysis_management import PERVERSE_DF\n",
    "Variance.summary(PERVERSE_DF['all_nan'], pd.Series({'mean': np.nan, }), PERVERSE_DF['all_nan']) # boolean value of NA is ambiguous"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "## Reproducing errors in the notebook\n",
    "Buckaroo printed reproduction instructions like\n",
    "```\n",
    "from buckaroo.pluggable_analysis_framework.analysis_management import PERVERSE_DF\n",
    "Variance.summary(PERVERSE_DF['all_nan'], pd.Series({'mean': np.nan, }), PERVERSE_DF['all_nan']) # boolean value of NA is ambiguous\n",
    "\n",
    "```\n",
    "\n",
    "`PERVERSE_DF` is a DataFame with all kinds of edgecases that normally trip up numerical code.  You can run the above two lines, and quickly start iterating on your `ColAnalysis` class to fix the error.  Normally adhoc analysis code that iterates over a list of functions blows up in a stack trace referencing an anonymous function in the middle of a for loop called with opaque variables.  Bucakroo gives you a single line that can reproduce the error, with easily inspectable variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "from buckaroo.pluggable_analysis_framework.analysis_management import PERVERSE_DF\n",
    "Variance.summary(PERVERSE_DF['all_nan'], pd.Series({'mean': np.nan, }), PERVERSE_DF['all_nan']) # boolean value of NA is ambiguous"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "## Quiet mode\n",
    "Sometimes you just want to get on with it.  Buckaroo has a setting for that too, set `quiet=True` and unit test errors, and regular processing errors will be silenced.  Not recommended, but if I didn't add it, users would write their own adhoc version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "w = buckaroo.BuckarooWidget(small_df)\n",
    "#There are errors in the following functions, quiet = True will ignore them\n",
    "\n",
    "def int_digits(n):\n",
    "    if np.isnan(n):\n",
    "        return 1\n",
    "    if n == 0:\n",
    "        return 1\n",
    "    if np.sign(n) == -1:\n",
    "        return int(np.floor(np.log10(np.abs(n)))) + 2\n",
    "    return int(np.floor(np.log10(n)+1))\n",
    "class MinDigits(ColAnalysis):\n",
    "    \n",
    "    requires_summary = [\"min\"]\n",
    "    provides_summary = [\"min_digits\"]\n",
    "    quiet = True\n",
    "    \n",
    "    @staticmethod\n",
    "    def summary(sampled_ser, summary_ser, ser):\n",
    "        is_numeric = pd.api.types.is_numeric_dtype(sampled_ser.dtype)\n",
    "        if is_numeric:\n",
    "            return {\n",
    "                'min_digits':int_digits(summary_ser.loc['min'])}\n",
    "        else:\n",
    "            return {\n",
    "                'min_digits':0}\n",
    "w.add_analysis(MinDigits)\n",
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Making a new default dataframe display function"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Adding a Command to the Low Code UI\n",
    "Previous versions of Buckaroo included a customizable low code UI.  This is temporarily deprecated as of 0.6\n",
    "Look at https://github.com/paddymul/buckaroo/blob/86df365278ac6933f7266c0b055a2ff90b072e9a/example-notebooks/Customizing-Buckaroo.ipynb for more info and install buckaroo at 0.4.6 or 0.5.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {},
   "outputs": [],
   "source": [
    "from buckaroo.widget_utils import disable\n",
    "from IPython.core.getipython import get_ipython\n",
    "from IPython.display import display\n",
    "import warnings\n",
    "\n",
    "disable()\n",
    "def my_display_as_buckaroo(df):\n",
    "    w  = BuckarooWidget(df, showCommands=False)\n",
    "    #the analysis we added throws warnings, let's muffle that when used as the default display\n",
    "    warnings.filterwarnings('ignore')\n",
    "    w.add_analysis(Skew)\n",
    "    warnings.filterwarnings('default')\n",
    "    return display(w)\n",
    "\n",
    "def my_enable():\n",
    "    \"\"\"\n",
    "    Automatically use buckaroo to display all DataFrames\n",
    "    instances in the notebook.\n",
    "\n",
    "    \"\"\"\n",
    "    ip = get_ipython()\n",
    "    if ip is None:\n",
    "        print(\"must be running inside ipython to enable default display via enable()\")\n",
    "        return\n",
    "    ip_formatter = ip.display_formatter.ipython_display_formatter\n",
    "    ip_formatter.for_type(pd.DataFrame, my_display_as_buckaroo)\n",
    "my_enable()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
